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Analysis of the South Slavic Scripts by Run-Length 
Features of the Image Texture 
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Abstract 

The paper proposes an algorithm for the script recognition based on the texture characteristics. 
The image texture is achieved by coding each letter with the equivalent script type (number code) 
according to its position in the text line. Each code is transformed into equivalent gray level 
pixel creating an 1-D image. Then, the image texture is subjected to the run-length analysis. 
This analysis extracts the run-length features, which are classified to make a distinction between 
the scripts under consideration. In the experiment, a custom oriented database is subject to the 
proposed algorithm. The database consists of some text documents written in Cyrillic, Latin 
and Glagolitic scripts. Furthermore, it is divided into training and test parts. The results of 
the experiment show that 3 out of 5 run-length features can be used for effective differentiation 
between the analyzed South Slavic scripts. 

Index, Terms - Classification, Coding, Image Analysis, Script, Texture. 


1 Introduction 

The Balkan region, which is populated by South Slavs, is very rich in cultural heritage elements 
of the medieval age. One of the most important cultural achievements represents the variety of 
used scripts. In the medieval age, South Slavs had spoken the old Church Slavonic language. It 
was written with the Glagolitic alphabet called round Glagolitic script, but later it was replaced 
by Cyrillic in the east region of Balkan, i.e. in Bulgaria and Macedonia. In Bosnia, the local 
version of Cyrillic alphabet was used, while in Croatia a variant of the Glagolitic alphabet called 
squared Glagolitic script was preserved. Accordingly, all books from medieval age were written by 
all aforementioned scripts. Currently, Serbian language is the only European standard language 
with complete synchronic digraphia, which uses both Cyrillic and Latin alphabets. 

The Serbian language has been studied in a natural speech analysis as a part of the South 
Slavic language group m- Similarly, the recognition of the South Slavic scripts can be used 
in document image analysis (DIA) and optical character recognition (OCR) [9j. Recognition of 
different script characters in an OCR module is a difficult task |6]. It is because the character 
recognition depends on different features like the structural properties, style and nature of writ¬ 
ing, which generally differ from one script to another. It is especially true for digitalization and 
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script recognition in the old books from the Balkan region, which are written by different scripts, 
i.e. Cyrillic, Glagolitic and Latin scripts. 

The script recognition techniques have been classified as global or local. Global methods treat 
the document image as a group of big image blocks. Then, the image blocks are statistically 
analyzed [Sj. The drawback of these methods corresponds to process the noise images, which can 
decay the recognition results [3]. Local approaches segment document images into small blocks 
representing text pieces, i.e. connected components. Then, they are subjected to the statistical 
analysis like the run-length, co-occurrence, local binary patterns, etc JT0] - The efficiency of the 
local methods is unaffected by the noise in the document image. Unfortunately, they are more 
computer time intensive than global methods. 

This paper presents a method similar to local methods. It is based on coding each letter 
in the script by its position in the text line uni, a- Accordingly, the method takes into 
account energy profile of each script sign in a certain text line and classifies it into four different 
groups [6]. Hence, the efficiency of the method is linked with the basis of the printed text. In 
this way, the number of initial variables is considerably reduced compared to the initial ones. 
The received codes are translated into the gray level pixels of the image. Then, the texture of 
the image is subjected to the run-length analysis. Extracted run-length features are the basis 
for distinguishing the scripts. This goal is accomplished by the classification tool GA-ICDA, an 
extension of GA-IC framework [1], which segments the data into clusters representing different 
scripts. 

The remainder of the paper is organized as follows. Section 2 addresses all aspects concerning 
the proposed algorithm. It includes the text-line definition and script modeling. The result 
represents the coded text which corresponds to 1-D four gray level image. Furthermore, it is 
subjected to run-length analysis of the image texture. Then, the obtained results are analyzed 
and classified by the clustering algorithm. Section 3 describes the experiment and custom oriented 
database consisting in training and test document sets written in different scripts. Section 4 gives 
the results of the experiment and discusses them. Section 5 makes conclusions and points out 
further research direction. 


2 Algorithm 



Figure 1: The stages of the proposed algorithm. 


The algorithm consists in the following three stages: (i) script coding, (ii) texture analysis, 
and (iii) classification. Script coding translates each letter into the gray-level pixel of the image 
in accordance to the position in the text line. Then, the texture analysis process derives a 1-D 
image to extract run-length texture features. These features are subject to the classification 
process in order to cluster classes representing documents written in different scripts. Fig. [l] 
shows the stages of the proposed algorithm. 
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2.1 Script Coding 

Document text can be segmented into text lines. Furthermore, each text line can be separated 
taking into account the energy of the script signs [5] into the following virtual lines m (i) 
top-line, (ii) upper-line, (iii) base-line, and (iv) bottom-line. These lines divide the text line area 
into three vertical zones m ■■« upper zone, (ii) middle zone, and (iii) lower zone. 

The letters can be classified depending on their position in vertical zones of the line, which 
reflects their energy profile. The short letters (S) occupy the middle zone only. The ascender 
letters (A) spread over the middle and upper zone. The descendent letters (D) enlarge into 
the middle and lower zone. The full letters (F) outspread over all vertical zones. Hence, all 
letters can be grouped into four different script types 0. Fig.0 shows the script characteristics 
according to the letter baseline position. 


Figure 2: Illustration of the vitual lines and vertical zones in the text line. 


Each of the script types can be exchanged with different number codes: 0, 1,2, 3. Accordingly, 
there exist only four script types, which lead to four number codes. Furthermore, to create a 
texture these numbers are transformed into different levels of gray. Fig. [3] shows the equivalence 
between script type number codes and gray levels. 

1-D Image 

cdg 0 12 3 

Figure 3: Script type equivalent number codes and corresponding gray levels. 



In this way, each text is transformed into the set of number codes (0, 1, 2, 3) and later into 
the pixels of only four gray levels. Obtained image is a 1-D image I m representing a texture, 
which can be subjected to the texture analysis. 

2.2 Texture Analysis 

Texture is a measure of the intensity variation in the image surface. Hence, it is used to extract 
information from the images that can quantify the properties like smoothness, coarseness, and 
regularity m ■ In this way, the texture is characterized by calculating statistical measures 
obtained from the grayscale intensities in the image. Aforementioned statistical measures can be 
used for classification and segmentation of the image [7]. 

Run-length statistical analysis is used to evaluate and quantify the coarseness of a texture 
M- A run can be defined as a set of consecutive pixels, which are characterized by the same gray 
level intensity in a specific direction. Typically, coarse textures contain long runs with different 
gray level intensities, while fine textures include short runs with similar gray level intensities. 
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Let’s suppose that we have an image I m featuring X rows, Y columns and M levels of gray 
intensity. The starting point is the extraction of the run-length matrix p(i, j). It is defined as 
the number of runs with pixels of gray level i = 1,..., M and run length j = 1,..., N, where M 
is the number of gray levels, while N is the number representing the maximum run length. In 
our case, each element of the run-length matrix p(i, j) represents the gray level run-length of the 
1-D matrix I m that gives the total number of occurrences of gray-level runs of length j and of 
intensity value i. A set of consecutive pixels with identical or similar intensity values constitutes 
a gray level run. Furthermore, various texture matrices and vectors representing semi-features 
can be extracted from this run-length matrix p(i, j) [13] : (i) gray level run-length pixel number 
matrix p p , (ii) gray level run-number vector p g , (iii) run-length run-number vector p ri and (iv) 
gray level run-length-one vector p Q . 

Gray level run-length pixel number matrix p p is defined as: 


Pp(hj) =p(i,j) ' j ( 1 ) 

Gray level run-number vector p g is calculated as: 

N 

Pg(iJ) = ( 2 ) 

j=1 

Run-length run-number vector p r is given as: 

M 

Pr(i,j) =^2p{i,j) (3) 

i =1 

Gray level run-length-one vector p Q is: 

Po(i,j)=p(i, 1) (4) 

Using aforementioned matrix and vectors, the following run-length features were originally 
proposed in [5]: (i) Short run emphasis (SRE), (ii) Long run emphasis (LRE), (iii) Gray-level 
nonuniformity (GLN), (iv) Run length nonuniformity (RLN), and (v) Run percentage (RP). 

SRE measures the distribution of short runs. It is highly dependent on the occurrence of 
short runs and is expected to be large for fine textures. It is given as: 


m n n ... 

sR E = -yy^l = -y^f> 

r i=i j =i J r j =i J 


(5) 


LRE measures the distribution of long runs. It is highly dependent on the occurrence of long 
runs and is expected to be large for coarse structural textures. It is calculated as: 


LRE = 


1 

n r 


M N 


i =1 3 =1 


1 

n r 


N 


Y^PrU)-f 

3=1 


( 6 ) 


GLN measures the similarity of gray level values throughout the image. Its value is expected 
to be small if the gray level values are alike throughout the image. It is computed as: 


M N M 

GLN = — 

r i— i j —i r i=i 


(7) 
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RLN measures the similarity of the length of runs throughout the image. It is expected to 
receive smaller values if the run lengths are alike throughout the image. It is defined as: 


n m . n 

RLN = — Z)(5ZK*.i)) 2 = — I>r(j) 


( 8 ) 


j= 1 i=l 


3 = 1 


RP measures the homogeneity and the distribution of runs of an image in a specific direction. 
It receives the largest values when the length of runs is 1 for all gray levels in a specific direction. 
It is given as: 


RP = — (9) 

n p 

In eqs. (5)-(9) n r represents the total number of runs, while n p is the number of pixels in the 
image I m . 

2.3 Classification 

Obtained features can be analyzed in order to extract only run-length features that can dis¬ 
tinguish different scripts. For this purpose, we adopt an extension of the GA-IC tool in IT], 
called GA-ICDA (Genetic Algorithms Image Clustering for Document Analysis). GA-IC is used 
for clustering images from a database. It creates a weighted graph. The nodes of the graph 
correspond to images. An edge of the graph connecting two images is established if the images 
are similar. Each edge has its weight, which expresses the level of similarity extracted from 
the feature vectors of each image. The graph of images is then clustered by applying a genetic 
algorithm that divides the graph into groups of nodes. GA-ICDA introduces three new aspects 
from GA-IC. First of all, each graph node is a document represented as a run-length feature 
vector. Secondly, an edge connects two nodes if related documents are similar and also the node 
distance is less than a threshold T, established a node ordering. Finally, a refinement procedure 
at the end of the genetic algorithm merges pairs of clusters with minimum distance to each other, 
until a fixed number is reached. 


3 Experiment 

The goal of the experiment is to prove the effectiveness of the proposed algorithm for the script 
recognition. Serbian or Croatian languages can be written by three different scripts: Cyrillic, 
Latin and Glagolitic. Hence, it is suitable for the experiment. For the purpose of the experiment, 
a custom oriented database of text documents written in Cyrillic, Latin and Glagolitic scripts 
is created. The database includes two parts: training and test set of documents. Training set 
consists in a total of 100 documents, while test set consists in a total of 15 documents. Each set 
incorporates Cyrillic, Latin and Glagolitic scripts in a similar number of documents. Both sets 
are subjected to the run-length statistical analysis. The obtained results are given and discussed 
below. 


4 Results and Discussion 

First, the training set is subjected to the run-length statistical analysis. The results are given in 
Tab. I for the Cyrillic, Latin and Glagolitic scripts. 
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TABLE I. TRAINING SET OF CYRILLIC SCRIPT. 



SRE 

LRE 

GLN 

RLN 

RP 

Cyrillic 






Minimum 

0.59 

21.00 

40.00 

38.00 

0.32 

Maximum 

0.63 

25.00 

200.00 

175.00 

0.36 

Latin 






Minimum 

0.70 

6.00 

80.00 

90.00 

0.52 

Maximum 

0.72 

7.00 

400.00 

390.00 

0.56 

Glagolitic 






Minimum 

0.66 

10.00 

60.00 

65.00 

0.39 

Maximum 

0.68 

15.00 

230.00 

230.00 

0.46 


TABLE II. TEST SET OF CYRILLIC SCRIPT. 



SRE 

LRE 

GLN 

RLN 

RP 

Cyrillic 






Minimum 

0.6036 

21.3960 

42.2632 

40.2281 

0.3314 

Maximum 

0.6230 

23.9298 

188.8730 

170.5437 

0.3470 

Latin 






Minimum 

0.7043 

6.3769 

84.0352 

92.8693 

0.5344 

Maximum 

0.7183 

6.7704 

350.5431 

377.5103 

0.5517 

Glagolitic 






Minimum 

0.6600 

10.2716 

62.7531 

69.0247 

0.4036 

Maximum 

0.6779 

13.9696 

214.4705 

226.0555 

0.4525 


Then, the test set is subjected to the run-length statistical analysis. The results are given in 
Tab. II for the Cyrillic, Latin and Glagolitic scripts. 

The above results represent the extracted five run-length features for each script. Currently, 
the classification problem is mandatory. First, a comparison of the training and test set results is 
important. If the test set results are a subset of the training set results, then the results are valid. 
Furthermore, we have to explore the run-length features that can be separated according to the 
given script. From Tabs. I-II, the values of GLN and RLN features are mutually overlapped 
between the scripts. However, SRE, LRE and RP characterize each script differently. Fig. [4] 
shows an SRE test set of the scripts. 


SRE (Short run emphasis) 



Glagolitic 


■ Minimum 

■ Maximum 
o Average 


Figure 4: SRE for test set of the scripts. 
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The SRE range given from minimum to maximum obtained values for each script is clearly 
distinct between each script. Hence, SRE is suitable for differentiation between the scripts. Fig. 
[5] shows a LRE test set of the scripts. 


LRE (Long run emphasis) 



Cyrillic Latin Glagolitic 

Scripts 


Figure 5: LRE for test set of the scripts. 


The LRE range for each script can distinguish between each script. Accordingly, LRE is also 
suitable for distinction between scripts. Fig. [6] shows an RP test set of the scripts. 


Rp (Run percentage) 


0.6 

05 

04 

03 

02 

01 

0 



Cyrillic Latin Glagolitic 

Scripts 


■Minimum 
■ Maximum 
□Average 


Figure 6: RP for test set of the scripts. 


The RP range for each script is quite distinct from each others. Hence, the combination of 
the SRE, LRE and RP can be used to freely distinct the characteristics of different scripts like 
South Slavic scripts: Cyrillic, Latin and Glagolitic. It represents a much easier solution than 
those given in [3]. GA-ICDA is used to classify the texture features and cluster data. 

Our goal is to classify a model and correctly predict the classes that represent documents 
written by different scripts. Hence, precision, recall and f-measure are preferred metrics to 
evaluate the proposed algorithm. Precision is the fraction of retrieved and relevant documents 
with respect to all retrieved documents. Recall is the fraction of relevant documents which are 
retrieved with respect to all relevant documents. F-Measure is the harmonic mean of precision 
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and recall. Classification by GA-ICDA on the test set established 3 communities with the correct 
prediction of each document in the right class as given in Table III. 

Hence, precision, recall and f-measure receive a value of 1. In this way, the proposed algorithm 
correctly predicts the classification of each document from a database in the adequate class 
representing Cyrillic, Latin or Glagolitic scripts (3 classes). Comparisons with other two well- 
known classifiers, Hierarchical Clustering and Expectation-Maximization, on the same run-length 
coded test set revealed the superiority of our approach. Each algorithm has been executed 100 
times and the average values together with standard deviation (in parenthesis) are reported. 


TABLE III. CLASSIFICATION RESULTS ON TEST SET. 



Script 

Precision 

Recall 

F-Measure 

GA-ICDA 

Cyr. 

1.0000 (0.0000) 

1.0000(0.0000) 

1.0000 (0.0000) 

Latin 

1.0000 (0.0000) 

1.0000(0.0000) 

1.0000 (0.0000) 

Glag. 

1.0000 (0.0000) 

1.0000(0.0000) 

1.0000(0.0000) 

Expectation- 

Maximization 

Cyr. 

0.7479 (0.2840) 

0.9780 (0.0986) 

0.8196 (0.2140) 

Latin 

0.6564 (0.2374) 

0.9700(0.1040) 

0.7515(0.1529) 

Glag. 

0.4774(0.0414) 

0.9800 (0.0603) 

0.6408 (0.0449) 

Hierarchical 

Clustering 

Cyr. 

1.0000 (0.0000) 

1.0000(0.0000) 

1.0000(0.0000) 

Latin 

0.5000 (0.0000) 

1.0000(0.0000) 

0.6667 (0.0000) 

Glag. 

0.5000 (0.0000) 

1.0000(0.0000) 

0.6667 (0.0000) 


5 Conclusions 

The manuscript proposed a methodology for the script recognition in the South Slavic docu¬ 
ments. It uses the run-length statistical analysis of the document based on the status of each 
script element in the text line. Due to the difference in the script characteristics, the results 
of the statistical analysis show significant dissimilarity. It represents a starting point for fea¬ 
ture classification. The proposed method is tested on documents written in Cyrillic, Latin and 
Glagolitic scripts. The experiments show encouraging results. Further research direction will be 
toward the statistical analysis of other scripts. [3 
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